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Abstract 

Within a Kuhn- Tucker cavity method introduced in a former paper, we study optimal 
stabihty learning for situations, where in the replica formalism the replica symmetry may 
be broken, namely 

(i) the case of a simple perceptron above the critical loading, and 

(ii) the case of two-layer AND-perceptrons, if one learns with maximal stability. 

We find that the deviation of our cavity solution from the replica symmetric one in these 
cases is a clear indication of the necessity of replica symmetry breaking. In any case the 
cavity solution tends to underestimate the storage capabilities of the networks. 

1 Introduction 

In a recent paper, [Q, we introduced a new kind of cavity method, with which we could 
solve the learning problem for perceptrons with Q- and Q' state Potts model input 
and output neurons. In this method, the Kuhn- Tucker conditions, which lead to optimal 
stability in AdaTron type learning processes, have been built into the cavity formulation. 
In subsequent papers we extended this method to the problem of the generalization 
ability of a perceptron trained for optimal stability 0, and to the problem of storing of 
correlated patterns 

In the present paper, we apply our method to cases where in the replica formalism 
the replica symmetry is broken, namely (i) to perceptrons with Ising neurons above the 
critical loading and (ii) to two-layer AND perceptrons. 

Cavity ideas were first applied to neural networks by Mezard, [Q, and Kinzel and 
Opper, p. The approach of Griniasty, 0, whose work will be discussed below in com- 
parison with our own findings, employs ideas introduced by Mezard. Our formulation of 
the cavity method, on the other hand, was inspired by the just-mentioned work of Kinzel 
and Opper. In another original approach, Wong employs ideas which are related to 
ours and Griniasty's. 

*based on the PhD thesis of F. Gerl, Regensburg 1994 
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2 Simple perceptrons above the critical loading 

We use the definitions as in , which are simpler than those we had to introduce for the 
Potts model case JIl. Our perceptron has input neurons k = 1,...,N, whose possible 
states are Sk = ±1, and one output neuron s', also with possible states ±1. The couplings 
leading from the input neurons to the output neuron are assumed to be real numbers, 
which are collected into the coupling vector J_ := {Ji, Jn) with the Euclidean length 
L := I J| = (J^ + ... + J^)^/^, which is kept fixed, while the components Jk are adapted 
to certain tasks by "training processes" (see below). The relation between input and 
output is 

s' = sign JfcSfc j , (1) 

which can be considered as a binary classification of the possible inputs, or as answers 
on questions. 

Now we assume that there is a training set of p input-output pairs, where the inputs 
are := (^f, ^tv) for fi = l,...,p, and the corresponding desired outputs are 
Here and in the following, unless otherwise stated, we assume that all input components 
and the outputs are independent random numbers, which take the values ±1 with equal 
probability. 

The optimization task, which the perceptron then has to fulfill in course of the training 
by adaptation of the components of the coupling vector J, is the minimization of a 
Hamiltonian, which performs a weighted count of bad classifications (see below) of the 
training examples. Among these Hamiltonians are that of Gardner and Derrida, 0, 
which simply counts the bad classifications, namely 

n:=j:V{En:=j:(^{K-{E'^/L)). (2) 

Here 6{x) = 1 for x > 0, = else, and is defined as usual as the "oriented field" 
acting on pattern n, 

N 

E^ = C^Y. Ml , (3) 
fc=i 

while L = \J_\ as above. Bad classifications are those, where E^ / L is < k, i.e. for 
K > they are not necessarily wrong but lack a prescribed amount of stability., which is 
measured by k. 

There are p =: a ■ N such "questions with prescribed answers", i.e. /U = 1, ...,p, and 
it is assumed that the loading parameter a remains finite, while N ^ oo and p ^ oo. 
Furthermore, the stability parameter n is to be maximized, if error-free classification 
(i.e. H = 0) is possible. 

Gardner and Derrida, [Q, evaluated not only the number of errors Ti above the critical 
loading adu), where for positive n error-free classification is no longer possible, but 
they also tried to evaluate the so-called Almeida- Thouless-line aArii^)- Above this line, 
within the replica formalism, replica symmetry breaking (RSB) is necessary. Surprisingly, 
Gardner and Derrida found in that a at > for k > 0: However, this was due to a 
subtle integration error recently discovered by Bouten, |Q, who proved that aAr('^) = 



2 



adu), as expected. Bouten also showed that replica symmetry is always broken, if the 
distribution of local fields possesses a gap. So these results, on which we will comment 
later, provide a test for our cavity approach. 

Other Hamiltonians, on which we comment at a later stage, are (see 
Perceptron function and the AdaTron function with 



10, 111) the 



viE") = [k- {e^'/l)y ■ e[K - 



(4) 



with X = 1 and x = 2, respectively. 



2.1 Representation of optimal couplings 

Instead of minimizing the weighted error rate f{a,K) := Ti/p, one can of course also 
maximize K,{a, f) for the given training set. Since we are above the critical loading adu), 
f -p of the training examples are badly classified. There is however an exponentially large 
number Af of partitions of the training set into a "good fraction" and a "waste fraction" 
of size (1 — /) ■ p and / ■ p, respectively, from which one has to choose the optimal one. 
Namely, Af ~ exp(c ■ p), with c = — / In / — (1 — /) ln(l — /). We nevertheless assume, 
that every one of these combinations has been trained for optimal stability, e.g. with the 
AdaTron algorithm [O. The optimal perceptron with error rate / is then given by the 



partition leading to maximal k. 

The couplings of a perceptron trained for optimal stability can always be expressed 
in the form |T^ 

Jk = ^ E ^'C'e, (5) 

M6{(l-/)p} 

with the so-called "embedding strengths" x^ of patterns, which do not belong to the fp 
badly classified patterns. As can be shown using Lagrangian multipliers |12, |l|, these 



embedding strengths have to fulfill the so called Kuhn- Tucker conditions, see below. 
Without restriction of generality, these are usually formulated by fixing the length L of 
the coupling vector J in such a way that the stability limit for k > corresponds to 
E^^ = 1, i.e. L = K,~^. With this convention, which we always use in the following, unless 
otherwise stated, the Kuhn- Tucker conditions are : 

either (x^ > and E^" = 1) or (x^ = and E^" > 1) . (6) 

In fact, the AdaTron algorithm (without overrelaxation) 

6x^ = ma.x{—x^, 1 — E^) (sequentially or in parallel) (7) 

simply fixes the x^ repeatedly to values which fulfill (^: If it converges, the conditions 
are necessarily obeyed. 

Using the "oriented correlation" matrix 

1 ^ 

= ccj,Y.eka- (8) 

k=i 

and the definitions (^ and(|^), we can write for the oriented field E^ 

E^" = Y, B^^'x" . (9) 
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With the Kuhn- Tucker conditions we have finally 

= Y.^^B^^'x" = ^Y.^^ ■ (10) 

The basic idea of the cavity method here is, to add a pattern, i.e. the "cavity", to a 
set of perfectly trained patterns. By calculating the necessary adjustments to embed this 
pattern we gain valuable information about the whole system. As in we therefore add 
a new "question with desired answer" (^'^, C°) to the training set, assuming one simple 
groundstate. 

We note that the distribution of the oriented fields acting on it, before any further 
adaptation has been performed, which in the following always will be indicated by the 
~, is Gaussian with average and variance | Jp = = k,^^. Of course the error rate / 
has to remain constant. Therefore, as will be seen below, the best strategy is, to give 
up and add the new pattern to the "waste fraction of the training set", if the oriented 
field E'^ acting on is smaller than a certain number Z < 1. For self-consistency, Z is 
determined by 



z 



—oo — oo 

where t := E^/L and z := Z/L. On the other hand, if E^ is > 1, then it is not necessary 
to embed in the couplings, since it can be added to the set of those correctly classified 
training patterns, which need no explicit embedding (see [1^) by the Adatron algorithm. 



I2| . Thus, only for Z < E'^ < 1, i.e. z < t < k, the new pattern must be embedded 
in the couplings, and the implementation strength of the other patterns must be 
corrected by 6x'^ (see below), to compensate for the infiuence of the new pattern. 

As we have just seen, the parallel AdaTron algorithm would in a first step try to 
embed if necessary, with the "bare" embedding x° = 1 — E^. This generates a 
perturbation B^^^x^ of the pattern /i. In a second parallel step, all those patterns /x, 
which are stored explicitly, then have to respond by 5x^ = —B'^^x^ to the disturbation 
by x^, because the Kuhn- Tucker conditions still have to be fulfilled. At pattern these 
corrections generate a response field 

gx'^= J2 B'^^'6x'' = - J2 {B^^'fx^ (12) 

which reduces the effect of the AdaTron step with x^. Therefore, one has to enhance 
x^ = 1 — E^ hj an amplification factor l/(l+g) (> 1). Now the {B^^Y 1/A^ on 
average, see (||). Therefore one gets immediately 

Y iB^n^ = a-P{x^>0)=:a,s • (13) 

tl,{x^^■>0) 

ttes is the percentage of exhausted degrees of freedom, i.e., if pattern is as typical as 
the other random patterns fi = 1, . . . ,p, one has to postulate 

g = -a^ff = -a ■ P{Z < E^ < 1) = -a £ Vt > (-1) . (14) 
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Note that we have constructed our algorithm in such a way that the further correction 
steps, which are necessary to regain the Kuhn- Tucker conditions exactly, do not change 
the response at pattern in the thermodynamic limit when convergence is achieved. A 
proof is given in the Appendix, where we also show that in this limit one can assume 
that the x'^ and B'^'^ are statistically independent. 

With our approach, the final embedding strength for is then given by (l — E^) /(I + 
g). Furthermore, we now identify the distribution of embedding strengths of all patterns 
with that of pattern Putting the Gaussian distribution of the cavity field E^, felt 
before training, into ([10|) gives 



aL Vt 



K 



(15) 



With L = K this imphes 



1 = — ^ / VtU-t)n 
1+gJ ^ ' 



(16) 



After multiphcation with 1 + this finally leads to our "Kuhn- Tucker cavity result" 



re re 
1 = a j Vt + an J Vt{K-t) 

z z 

K 

= a (l + K^) Jvt + 



(17) 



This result will now be compared with that of the cavity approaches of Griniasty, 
PI, and Wong [0. Their different result is equivalent to the replica formalism in the 
replica-symmetric approximation, H, |10|; therefore we use the suffix "RS" to indicate 
their result. We have already shown in [1] that below Oc, where replica-symmetry is 
exact, our Kuhn- Tucker cavity approach and the RS approach agree; however in the 
present situation, where a is > ac, they disagree. 

When Griniasty derives the constant of integration in 0, he uses a simple method 
which in all cases known to us gives the correct RS result: The reaction factor g is 
assumed to vanish, while at the same time also the B'^'^ with ^ ^ v are neglected. With 



these two neglections one obtains instead of eqn. ([T0|) 



1 



E( 



(19) 



Thus, instead of eqs. (|17D and ([T8|) , the "RS" cavity result would be 



ftRs Jvt{K- tf 

z 

K 

a^Jl + K^) f Vt 



2tt 



(20) 



Eqn. (^), which is identical with the result obtained by Griniasty ^ or Wong |]^, 
agrees with the result of the replica calculation of Gardner and Derrida |TU] from the 
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replica-symmetric approximation; the expression abbreviated by M in eqn. (^) yields 
the difference between our acav (=« obtained from eqn. (|TS|)) and a-^s- Because of z < k, 
it is M < and therefore > ctcav For z — > — cxd, i.e. for / ^ 0, one has M = 0, and 
thus for a < Oc it is Ors = acav, as aheady mentioned. 

Although the numerical results differ, there is an interesting formal relationship be- 
tween the basic equations in Wong's approach and the one presented here: Eqn. (6) 
in agrees formally with our eqn. ([16|) , and our reaction strength g corresponds to 
—a ■ X ill 0- However —a ■ x in differs from our g, which is given by eqn. (p!4D, 
by a term corresponding just to the expression M in eqn. (pOf). The difference arises 
from a 5-function contribution to \'{t) aX t = z va. where \{t) and t in are the 
oriented fields after training and before training, respectively. Probably this difference, 
which only comes into play above Uc-, where M is 7^ 0, is relevant with respect to the 
combinatorial explosion, which conflicts with the assumption of a unique optimum made 
in all the approaches. Research on this problem is in progress. 

In Fig. 1, for error rates of / = 0.2 and 0.02, the learning capacities /) = Ocav and 
Ors, respectively, as obtained from our Kuhn- Tucker cavity theory, eqn. (p!7D, and with 
the replica-symmetric approximation (^) respectively, are plotted against the stability 
K. Obviously, the learning capacity obtained with the Kuhn- Tucker cavity theory is 
lower, particularly at small values of k, i.e. for k = and / = 0.2, acav is only 10/3, 
whereas Ors is 15.53. This means at the same time that for given k and a, our error 
rate / would be larger than that obtained with eqn. (120). This is a strong hint that the 
assumption of a unique optimum for the distribution of those patterns, which are put to 
waste, is wrong. Therefore, one cannot say in advance, which approach is better: Rather 
what has been gained is the following: Since both approaches agree below a;c(K, / = 0), 
but not for / > 0, we can use the different results as a criterion for the necessity of 
replica-symmetry breaking for / > 0, which agrees with the recent rigorous proof of p|. 



2.2 



Majer et al. 



Comparison with a One-Step-RSB calculation 

np, have performed a calculation above a^K, f 



0) within a one-step 



replica-symmetry-breaking approximation. From eqn. (^, they obtained the following 
result for the free energy (/e) = a/ (a, k): 



(/e) 



mm 

x,qo,w 



2x(l + wAg) 
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+ — I Vzo In 

wx 



+ exp{—wx}^{ 



Aq 
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A 
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A 



Ag)'} 



$(- 



(21) 



with A = K — Zoy/qQ and Aq = 1 — go- The parameters go, w and x have to be chosen 
such that the number of errors is maximized. In the limit go — one regains the RS 
result. 

In Fig. 2 the stability k is plotted against the error rate / for three different values 
of a. Obviously there is only a small difference between the results of the RS and the 
1-step RSB calculation, in contrast to our cavity results, which differ considerably from 
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both replica calculations, i.e. with our estimate, the stability increases much more slowly 
with increasing a> ac- 

For / ^ 1, we can compare these different estimates with simple simulations: For 
a < 2 one can train perceptrons to optimal stability by means of the AdaTron algorithm. 
If one then skips the pattern with the largest embedding strength, i.e. the one which 
was most difficult to store, and re-learns the remaining patterns, one gets an enhanced 
stability, which agrees within the error limit with the replica calculation, and not with 
the cavity results, as we found in the simulations. 

Thus, for / > 0, our Kuhn- Tucker cavity method yields a non-sufficient approxima- 
tion for /). In the following we try to find the reasons for the discrepancy and to 
estimate at the same time the quality of the 1-step RSB calculation. For this purpose, 
we need the distribution of the oriented fields t, which according to Majer et al., [Ill], is 



> IVz, exp{-^x(t/(A„) + (A.-'.VW-..^)' )} 

where V has been defined for the Gardner-Derrida rule in (0), and where Aq has to be 
chosen such that the exponent is minimized. 

The parameters Qq, w and x are determined for given a and k from eqn. (pi]). With 
eqn. the probability distribution of the local fields can be determined; in particular a 



(5-peak at t = 1 is obtained, which represents the patterns, which are embedded explicitly 
by the training. However, our interest is in the field-distribution before the additional 
training. 

This distribution can be obtained by extending Griniasty's interpretation from the 
RS results in to the present case. 

According to 0, after addition of the pattern the quantities ^o\/^ — Zl^/Aq and 
Ao are the local oriented fields felt before and after the additional training, respectively. 

The exact value of Aq results from a compromise between the increase of the energy 
V"(Ao) for patterns, which are badly classified by the perceptron, and the term (Aq — 
Zoy/qo — zi\/ AqY /2x representing the increase in energy of the (1 — f)p patterns, which 
had been stored before the addition of the new pattern. 

Thus we can determine the field before the corrections by simply eliminating l^(Ao) 
and Aq in (^21): With the abbreviations Uq = Zqa/^o and Ui = Zi^/Aq, the integrand in 
eqn. (|22|) is 



+ exp{—wx) 6{k — V2x — (mq + ui)) + 6{uo + ui — k) 



^ J Vzi exp -Uq- uif^ 9{k -Uq- ui)9{uo + Ui - {k - V2x)) 

6(t - {Uo + Ml)) 



1 [ (t - Mo) 



2 ■ 



AA^^^l 2Ag 

+ ex.p{-wx}e{K - V2x - t) + e{t - k) . (23) 

Here the first term describes the patterns, which have been successfully embedded 
(i.e. with V = 0), the second term those, which are badly classified and put to the waste 
(i.e. with V = 1 and Aq — ZQ^/qo — Z\^Jq\ = 0), and the final term the patterns which are 



exp{--(K - ty}9{K - t)e{t - (k - V2x)) 
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stored without embedding. The denominator M in ( P5D is the integral over all fields t in 
( P^D and serves for normalization. Thus, we have no longer a Gaussian field-distribution 
before learning. 

In the cavity picture, this non-Gaussian distribution of the fields acting on an un- 
trained pattern stems from the fact that there are now many ground states available. If 
we add a pattern to such a multitude, every groundstate will still see a Gaussian distri- 
bution of the local field . However the particular ground state, which after training 
appears as the one with the highest stability, is likely to have a higher-than-average local 
field for the new pattern, thus being able to store it more easily. Our field-distribution 
before learning is then effected by this selection. At the end of this section we will shortly 
comment on how an intrinsic cavity approach for RSB should take this selection effect 
into account. 

In Fig. 3, for k = 1 and a = 1, the local fields are presented for the three approxima- 
tions mentioned, namely for our cavity method, and for the RS and RSBl approxima- 
tions. One can see that in RSBl, the distribution of the local fields before learning the 
additional pattern, although strongly non-Gaussian, is still everywhere continuous, and 
for t = K even continuously differentiable. 

Comparing the results of our cavity theory with the RS approximation, one can see 
that for increasing a > ac 

• on the one hand, the error rate f obtained with the RS theory is much better 
than that obtained with our cavity method (if the error rate obtained with RSBl 
is considered as target approximation), see Fig. 2; whereas 

• on the other hand, the value of the limiting negative field value t := to = z/k, 

—0.5 in Fig. 3), below which patterns are no longer learnable, is almost exactly 
the same with our simple cavity approach and the much more complicated RSBl 
calculation. 

If we accept the local fields from RSBl as a good approximation, then again the Kuhn- 
Tucker conditions (^) must be fulfilled after learning, and with ([T7| ) we can calculate 
the loading q;rsb-kt from the RSBl field-distribution, calculated with the parameters 
q, w and x. To this purpose, in the above-mentioned formula one only has to replace 
the Gaussian measure Vt by the RSBl field-distribution of those patterns, which are 
expicitly embedded, i.e. from to to k(= 1) in Fig. 3. This is related to a^ssi in a similar 
way as Ocav is related to 0^3. 

For three values of k, the result /kt := /rsbi-kt of such a calculation is presented 
in Fig. 4. For comparison also the results of the two simple approximations, /cav and 
/hs, are presented, together with the RSBl replica result /rsbi- Obviously, the improved 
cavity result /kt is only slightly higher than /rsbi, which is already a criterion for the 
quality of the RSBl calculation compared with the RS theory. 

One expects therefore that the rigorous result should he between our /kt as an upper 
and /rsbi as a lower bound, so that further replica symmetry breaking steps should give 
only a slight improvement. Precisely, we expect both a slow monotoneous increase of the 



error rate /rsbh with increasing n in a RSBn calculation, |]14[, and a slow monotoneous 
decrease of /rsbh-kt, and at the same time a decrease of the relative number of explicitly 
embedded patterns and of the averaged embedding strength. As a consequence, aRSBn-xx, 



which is calculated from a formula analogous to (|T^, increases slowly. For n 
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the RSBn replica result and the RSBn-KT cavity result derived from the RSBn field- 
distribution should agree. 

We have thus found a method leading to an upper bound for the error rate /(a, k) as 
a function of a and k, if the local fields before re- learning are known. In fl^ , Fontanari 
and Theumann derive an upper bound for / at /€=0, evaluated in RS approximation at 
the AT line for finite temperatures: For a < 50, our RSBl-KT upper bound is lower, 
whereas for a > 50 the bound in Fig. 2 of |0 is lower. 

That the RSBl itself is not yet sufficient, has already been shown by Majer and Engel 



rT| . Moreover, also the fact that after our RSBl-KT cavity learning there is still a ga'p 
in the field-distribution, which is according to Bouten responsible for RSB, points to 
this fact. Additionally, in this approximation the condition is violated that the number 
of explicitly embedded patterns should not exceed the number N of coupling degrees of 
freedom, see eqn. (pS)). 

In connection with Wong's paper, 0, it is interesting to note the following: Using 
the RSBl distribution of fields in connection with eqn. (6) in 0] one reproduces self- 
consistently the RSBl result for a of Majer and Engel, [|11|. The slight difference to our 
result arises again from a contribution of a (5-peak, which is present in Wong's approach, 
but not in our's. This contribution, much smaller now, which in turn originates from the 
mentioned gap, leads again to terms corresponding formally to M in eqn. (20). Thus 
once more the difference between our results and those obtained with Wong's approach, 
both starting with the above-mentioned RSBl field distribution, shows that also the 
RSBl calculation, albeit a good approximation, is not yet exact. 

The results discussed here have recently been supported by a complicated 2-step- 
RSB calculation: Whyte and Sherrington find in |15| that in fact the next step in the 
replica breaking scheme raises the error rate /, but only by an amount which is typically 
0(10"^). This agrees very well with the predictions that we could make from our findings. 

As already mentioned, the results of the cavity methods are necessarily insufficient 
above ttc, since the combinatorial possibilities to select the "waste patterns" are not 
considered. However, with our approach it should also be possible to get equations, 
which are equivalent to RSBl, without using the replica trick, i.e. as in the seminal book 



161 on spin glasses. 

To achieve this, one considers a multitude of ground states, which are ordered in an 
ultrametric structure. If one decreases the stability constraints, the number of different 
ground states is assumed to increase exponentially. If one now adds a pattern to this 
ensemble, a lower embedding strength is therefore favoured, which gives a higher storage 
capacity compared to the Gaussian one of our eqn. (|T7p. This approach has the appealing 
trait, that the number of patterns, which have to be misclassified, gets lower as more 
degrees of freedom become available for the approximation. Within the replica method, 
in contrast, one has to take the "worst" value of the order parameters for technical 
reasons. Work on an intrinsic cavity approach for a > etc is in progress. 



3 The AND-machine 

For multilayer perceptrons, replica symmetry breaking is a general phenomenon, when 
optimal learning capacity for a given stability or optimal stability for given capacity 
are required. As a consequence, the optimal capacity is not easily estimated. Griniasty 
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and Grossman [IT^ have calculated the storage capacity of the AND-machine and gave 
arguments, which made a suppression of replica symmetry breaking credible. However, 
with our method we are able to check their proposal and find that replica symmetry is 
broken. 



3.1 Model description 

The AND-machine is a simple example for a multi-layer perception. Multilayer percep- 
trons have been introduced to overcome the limitations of single layer perceptrons, which 
are limited to linearly separable classifications [|T^], whereas with sufficiently many units 
in the additional layer(s) between the input and output units, every Boolean function 
can be implemented. However, at the same time, the analytical treatment of the mod- 
els becomes much more complicated, and in general, the space of solutions is no longer 
simply connected. The RS approximation yields then results which contradict the exact 
bounds derived by Mitchison and Durbin |jT9|. However, the necessary RSB calculation 
is rather complicated for such models. A multilayer model, which has been studied in 
this way, is the so-called committee machine with 3 (or any other odd number Nh of) 
hidden units [|^, Here the intermediate hidden layer consists of three neurons, and 
the output is given by the majority vote of these hidden neurons. The maximal capacity 
of the system decreases from etc — 4.02 for the RS approximation to ac — 3.0 for the 
KSB- ansatz. This is for the case of non-overlapping receptive fields, see fig. 5. 

In contrast to the mentioned case of the committee machine, for the AND machine 
the number of intermediate neurons is arbitrary (> 2); here the output unit gives a 
positive vote iff all intermediate neurons vote with +1. This AND machine was at first 
treated by Griniasty and Grossman, both with non-overlapping and also with over- 



lapping receptive fields (NRF and ORF cases, respectively). Within a replica-symmetric 
approach, the authors find for the case of an equal number of patterns with output +1 
and —1 for = 2 intermediate neurons a critical capacity of ac — 3.5 {ac — 3.3 with 
simulations), for the case of overlapping receptive fields, and ac — 3.66 for the NRF case, 
rrf. At the end of their paper, Griniasty und Grossman discuss the validity of their RS 



approximation and find support for their assumption that only a single minimum con- 
tributes to their solution. Griniasty Q repeated the calculation with his cavity method 
and got the same results. Wong's recent result on this problem will be commented 
upon below. In the following, we use our different cavity method. 



3.2 Calculation of the learning capacity 

As we have seen, our method yields a convenient way to recognize the necessity of RSB. 
Additionally one gets a rough quantitative estimate, to which extent the actual solution 
is approximated. At first we study the AND machine with Nh = 2 and non-overlapping 
receptive fields (NRF case). Since this does not necessitate an additional effort and the 
argumentation is simplified, we assume that both subnetworks are trained with optimal 
stability, as long as the optimal capacity ac is not yet reached. 

We assume p+ = a+N and p- = a-N patterns with positive resp. negative output. 
Then the parameter b is defined via 

l±b 

a± = a • (24) 



10 



The (+)-patterns must be trained in both subnets, since both intermediate neurons 
must vote positively, whereas for the (— )-patterns one can choose at will a subnet with 



a negative vote. From this fact, Griniasty and Grossman [O derive the sharp bounds 



6 4 

< a, < — — (25) 



2 + 6 - - 1 + 6 

for the maximal capacity etc, which also applies with overlapping receptive fields (ORF 
case). Moreover, recently Wendemuth was able to generalize the method of Mitchi- 



son and Durbin |T9| and obtained for the AND-machine as a lower bound the sharper 
condition 

>ac , (26) 



4-(l-6)log2(3) 

which also applies both to the NRF and the ORF case. 

Let us consider fictitiously all possibilities to distribute the responsibility for the (— )- 
patterns among the subnetworks; these patterns, including the (+)-patterns, shall 
then be trained to optimal stability. Afterwards we select that distribution, which leads 
to maximal stability k, = minj/ti, K2}. 

In both subnetworks i = 1, 2 we have input neurons, and in both subnetworks the 
couplings are defined with embedding strengths as 

J^k = ^E<C%'^ • (27) 

Here is the desired final output. As before, we define the oriented correlation 

— * 

matrix B, the length L of the coupling vectors and the oriented local fields E 

Br ■■= ^c^cEc'ic'; (28) 

k 

Ll := Jjl. = ^Y.<Brxt (29) 
Et := E^r< • (30) 

V 

The embedding strengths , for i = 1,2 and all /i, together with the E-^, fulfill again 
the KT conditions 



either (xf > and E^" = 1) or (xf = and E^" > 1) . (31) 

Additionally, we know that for our optimal choice, a (— )-pattern can have a positive 
embedding strength only at one of the subnets, since otherwise the less stable subnet 
could enhance the stabihty by reducing its unnecessarily positive xf to and relearning 
the other patterns. Furthermore, in the thermodynamic limit, both stabilities should be 
equal, i.e. = L\ =: L^. Let us assume again that there is only one groundstate, and 
add a new pattern. Again this feels a normally distributed random oriented field E^ with 
variance LF' in each subnet. A (+)-pattern must be classified correctly in both subnets. 
So we need, in case of E^ < 1, positive embedding strengths x° = (1 — E^)/{1 + gi), 
where the response factors Qi have still to be determined. In contrast, for a (— )-pattern 
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we can choose the subnet with negative intermediate output. As we will see below, it 
is best to embed the (— )-pattern in that subnet, where the oriented field is larger, so 
that the embedding strength can be smaller. The field-distribution P{t) for the larger 
oriented field t, with normalized couplings, (-E^ax = can be calculated from the 
Gaussian distribution of the fields (^1,^2) of the two sublattices as follows: 



2 



P{t) = 2P{h = t,t2<t) = ^e-*'/2 f 

V 27r J 

—00 

e-*'/2^(t) . (32) 



27r 

If again we assume that x'^ and B'^" are uncorrelated, then the answer g of the patterns, 
which had already been implemented and now must keep the KT conditions, is similar 
to eqn. (|T^), namely 

- (B^^y = -«cfr = -aP{x^ > 0) . (33) 

M 

(a;M>0) 

Identifying again the distribution of the embedding strengths with the probability 
distribution for and normalizing the couplings to 1, we get 



K 

— 00 —00 



1^ h 

J T^t - J Vt^(t) . (34) 



Here the first and second parts describe the infiuence of (+)- and (— )-patterns, 
whereby compared with eqn. (^) a factor 1/2 was taken into account, since only one of 
the two subnets ist needed for (— )-patterns. The capacity for finite stabilities is again 
calculated from the KT conditions (^) through 

Li = ^y: ^tsr^ = ^ E ^r^r = E (35) 

L " 

= a dxi Xiw(xi) = — — / Vt (a+ + a_$(t)) ■ in — t) . (36) 
J 1 + g J 

— CO 

Again, with k = 1/L and multiplying by {1 + g), we obtain finally from our cavity 
method 

K 

1 = a/pt + i^$(t))-(l + «:(K-t)) . (37) 

— oo 

Obviously, every other strategy to store a pattern would lead to a smaller learning 
capacity with our method. The maximal capacity for k = is then with our method 
(i.e. with a = acav) 
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1 ^ ° 



-oo 



1 + 6 1-6 5 + 36 



This corresponds again to the hmit g = —1 oi eqn. ( p^ ) and is compatible with the 
exact bounds (p5D, although it is somewhat lower than the improved lower bound (pGl). 
This, again, is no surprise, since we always expect a somewhat too low estimate from 
our method in case of RSB situations. 

Now, in contrast, we perform the "handwaving approach" mentioned by Griniasty, 

see 0, to neglect the non-diagonal terms of B and assuming at the same time g = 0. 
Instead of eqn. ( p7D this leads to "RS" results, namely 

^ D - —X^fxf" ^2 



AT ^Rs, i -B RS, i 2;rs, i — 



"Ra, 1 no, 1 —tvo, 1 / ^ \-^RS. i) 

K 

a,s f Vt + ^—^m) ■ i^-t)' . (39) 



-oo 



For K = this yields instead of eqn. ( p8D the result of Griniasty and Grossman, |T7| , 
namely 

= ^ + 0.045422528... . (40) 

These results are for non-overlapping receptive fields (NRF). For finite k and 6 = 0, 
the maximal capacity etc is compared for both approximations in fig. 6. Additionally, 
also the result for overlapping receptive fields (ORF, afuu, see below) is presented as the 
dashed curve, which is slightly lower for the present case, but not always. This will be 
discussed later in more detail. 

The fact that the RS result differs from our cavity result is, according to our expe- 
rience, a strong hint for the necessity of RSB. The extent of replica symmetry breaking 
seems to increase with higher loading and larger percentage of (— )-patterns. 

For = «_ the maximal capacity according to our theory is (ac)cav = 3.2, whereas 



eqn. (p6D leads to ac > 3.31 and the RS approximation to (ac)Rs = 3.667. Probably 
the true result is again smaller than the RS value, but not too far from it. Thus, for 
a+ = a_, the RS appoach yields again a good estimate although replica symmetry 
is broken: The limit g = —1, above which the system is obviously over-determined 
concerning the number of couplings, is already reached at (ac)cav, i-e. below (ac)Rs. 

The local stability has not been checked by replica calculations. Recently, however, 
Wong could use his cavity method [0 to check a large class of multilayer perceptrons 
and found that for the AND-machine replica symmetry indeed is broken. 

Our cavity method is not only applicable to the present case of an NRF-AND machine, 
but also for NRF-machines with arbitrary Boolean output functions: If one has found 
the optimal strategy similar as above, the steps leading to eqn. (^) are identical, and 



13 



one only has to substitute the embedding strengths, which compensate the normally- 
distributed random field, in this equation. 

Formally this leads to the following equation for the storage capacity a = minj{aj}: 
If t is the random vector describing the oriented fields of the subnets and Xi^^it) the em- 
bedding strength following from the optimal learning strategy for the subnet i (without 
taking into account response g), then 

1 = aijvt [e{x^{t)) + KXi,^{t)] . (41) 

Here the different desired outputs, analogous to the (+) and (— )-patterns in case of the 
AND-machine, have to be taken into account according to their respective probabilities. 
The optimal storage capacity is obtained, if one of the subnets has reached its capacity 
limit. If the output value follows from a Boolean function, for which all the neurons in the 
intermediate layer are equivalent, then all the a, are identical. The formula analogous to 
( P^D , giving an upper estimate for the storage capacity, was for k = already determined 
by Engel et ai, ||2T|, and was described in a more abstract way in [T^, In a formulation 
similar to (^TJ), this upper bound is determined from 

1 = a, JvtxUt) . (42) 



In contrast, the results for ac obtained from eqn. (^) with the cavity method are 



lower estimates. For the committee-machine, because of acav < 2, they violate the lower 



bound of Mitchison and Durbin [IS], a > 2, and are therefore without interest. 

More interesting would be a RSB calculation as suggested at the end of the section 
on the simple perceptron for a > ac, since then relevant estimates from below for the 
true capacity for this model could be derived. 

3.3 The AND-machine with overlapping receptive fields 

Also the fully connected AND machine can be treated with our cavity method. This 
correponds to a machine as in fig. 5, but with identical patterns presented to both 
subnets. Thus the index i = 1,2 of the description of the patterns {^^j, e.g. in (pTj), 

-^-^ 

is now dummy, and in particular it is Bi=B2=-B- A very important parameter for the 
fully connected AND machine is the overlap R := J2k -hk-hk of the two subnets. 

Since also for this case one expects RSB, we can no longer expect that R agrees 
for different approaches, as it happened with the generalization problem treated in 0. 
Therefore we can no longer simply compare with the RS results of |1^ . 

For the local fields ti and t2 one gets for given R 

D (+ + \ ^ ( tl - 2Rtit2 + tl\ 

However, the probability density of the field ti of a (+)-pattern needing explicit embed- 
ding is not infiuenced, since again 

dt2PR{ti,t2) = -^e-'^' . (44) 
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For the (— )-pattems the situation is more comphcated. Negative correlations of the 
subnets simphfy the storing of these patterns, whereas positive correlations enhance the 
probability that a (— )-pattern, i.e. one which is already wrongly classified by one of the 
subnets, cannot be stored automatically by the other one, i.e. that it must be embedded 
explicitly. In the limit b -—>■ —1, i.e. when exclusively (— )-patterns must be embedded, 
J_i = —J_2 (with i? = — 1), for arbitrary couplings J_i, is a general solution of the problem 
in the limit a oo. 

For arbitrary b, the field of a (— )-pattern before learning is 

t 

Pnit) = 2Pniti=t,t2<t) = I dt2PR{tM) 

— oo 

- 2 ,-,V2$f4l^^ , (45) 



VVl-i?2 

and one obtains (0) as special case for i? = 0. Again, a single groundstate is assumed, 
and again the arguments follow from eqn. (^) ff. 
For the response g and the capacity a we get 

— oo — oo 

Similarly as for the generalization problem in section 2.3 of 0, only that solution a{R) 
is relevant, for which R is reproduced selfconsistently. For normalized couplings one 
obtains 

= E JikJ2k = E E 4CTkxKTk = E ^1 ^"^^^2 = E ^1^2 

k k IJ-^ A* 



a_ j dti j dt2PR{hM)']^t2 



-00 

00 



+ a+ I dti / dt2PR{tiM)-, — - m&^{KM} ■ (48) 
J J 1 + fi' 

~oo ~oo 

In eqn (^) the above-mentioned strategy for the storing of patterns was used: For the 
(— )-patterns the subnet with the smaller oriented field, here ^2, is unchanged, whereas 
the (+)-patterns are explicitly embedded in both subnets i, if the field ti is < k. Again 
one multiplies with (l+f?) to simplify (|4^). 

For K = it is again g = —1, and therefore one obtains R from 

, ti 




= j dti j dt2PR{ti,t2){-ti)t2 

— 00 —00 

+ ^ j dh j dt2PR{h,t2){~h)t2 . (49) 
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The result for R{b) is presented in Fig. 7. One should note that it deviates from the 
result obtained by Griniasty and Grossman in ||T^. From a comparison with Fig. 3a in 
r7[ one finds that our value for R is systematically larger, which - as mentioned above 



- has a negative effect on the storage capacity. 

In Fig. 8 the capacity of the fully connected AND-machine is presented for the case 



that the capacity in (^) is calculated for k = with the R determined from (^91). For 
6 = we get R = 0.217 and a = 3.113, as opposed to i? = and a = 3.512 obtained 



by Griniasty and Grossman, jl^. Again we expect that the RS-approximation gives the 
better estimate. However, our estimate for b < — 0.65 is above the lower bound (^), as 
it should, and in the limit 6 — > — 1 one gets acfuu ^ oo. 

Apart from the fact that different capacities ac are obtained with the replica and 
the cavity approach, there is a second hint on RSB: The replica calculation with the RS 
ansatz yields an overlap R between the subnets, which is not reproduced by our "optimal" 
learning algorithm. For the generalization problem, see 0, the perfect agreement of the 
results obtained with the two approaches was a clear hint on the correctness of the 
solution, and in particular on the correctness of RS. In the present case, however, one 
finds that the RS solution apparently yields the optimal capacity only, when an additional 
component to the coupling vector J is introduced, by which the overlap R between the 
subnets is reduced, but the embedding of the patterns is actually weakened. 

In fact, in Q], Griniasty suggests a non-trivial training process for the fully connected 
AND-machine, by which patterns in the non-affected subnet are unlearned, influencing 
in this way the correlation R. According to our considerations, however, this is not 
optimal, since after deleting the additional component one can enhance the stability or 
store additional patterns, using the same distribution of tasks with respect to the different 
patterns. Thus it is not astonishing that Griniasty with his training process obtains 
a smaller capacity (a^ = 3.0) as with a stochastic algorithm, which leads to ac = 3.3, 



17|. This discrepancy, which demands an additional component to the coupling vector. 



by which the subnets are decorrelated, should become smaller by a RSB calculation. 



4 Discussion 

As we have stated already in [1] , our cavity approach is usually technically simpler than 
a replica calculation. Moreover, it gives the exact result as long as calculations within 
the replica approach under the assumption of Replica Symmetry (RS) are correct. Here 
we have shown additionally that for models with Replica-Symmetry Breaking (RSB), 
e.g. the simple perceptron above ac, one gets different results, as one should, with our 
"Kuhn- Tucker cavity method" and the RS approximation: This would not be the case 
e.g. with the different cavity approximation of Griniasty p or Wong 0, since their 
methods are always equivalent to the replica calculation in RS approximation. However, 
although our cavity theory "indicates the necessity of RSB", if RS does not suffice, it 
is still far from being exact, since the combinatorial explosion of distributing the set of 
patterns into "good" patterns, which are stored, and "bad ones", which are not, is not 
considered. 
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From the results of the present paper it can be seen in detail that for cases with RSB, 
Griniasty's "RS-cavity theory" approach usually yields too optimistic estimates, whereas 
in the same cases, our Kuhn- Tucker cavity method is apparently "too pessimistic" in the 
error rates. However, the limiting negative field value, below which patterns are no 
longer learnable, (e.g. t ^ —0.5 in Fig. 3), is well approximated by our theory, and as 
shown above, the theory also gives good results, if one starts with the field-distribution 
obtained by a 1-step-RSB calculation. 

Moreover, the RS approximation follows for our models, except of the last-mentioned 
case of the fully connected AND machine, always from the "formally crude", but con- 
sistent approximation of vanishing response-factor g and vanishing off-diagonal elements 
B^'^, as stated already by Griniasty, [^. For this fact we do not yet have a deeper 
understanding. 

With our Kuhn- Tucker cavity-approach, we follow the embedding of a new pattern 
in detail: The newly added pattern is embedded by one single AdaTron step with an 
enhanced implementation strength (1 — E'^){1 + g)^^, where the enhancement factor —g 
is given by eqn. (0). At the same time, the already embedded patterns get specific 
corrections 6x^ = 0{1/\/N) of their implementation strengths. In the Appendix we 
show that for N ^ oo there are no further corrections for g necessary, which would go 
beyond the one-step procedure. At the same time, we have gained in this way knowledge 
of the actual distribution of the embedding strengths. 

Another important point of our cavity method is the demand that the constraints, 
which the solutions impose on the couplings, are actually realizable, which means that 
no more degrees of freedom are fixed than are available with the given couplings. For the 
fully connected AND-machine an additional postulate is that the correlation of the sub- 
nets is self-consistently reproduced by the embedding strengths. Thus we can interprete 
our result as an estimate adapted to our training algorithm. 

Although replica calculations in RS approximation do not at all take care of such 
details, we have found that they yield good estimates for the storage capacity of the 
simple perceptron above etc, and probably also for the AND machine. In this respect, 
the virtue of our approach is based on two facts: 

At first, it yields an independent estimate, which seems to be a lower bound for ac 
and an upper bound for the error fraction f{a,K). 

Second, our cavity theory visualizes the "internal stresses" inherent in the RS ap- 
proach and shows, which quantities and order parameters depend most sensitively on 
the assumptions made. 

Finally we repeat that our Kuhn- Tucker cavity approach, starting from field-distri- 



butions, which Majer and Engel, see [^, obtained in 1-step RSB for the perceptron 
above adK,), shows that the 1-step RSB results are not yet exact, but must already be 
very near to the truth. This conclusion is supported by a recent 2-step RSB calculation. 
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Appendix 



In this Appendix we show that for N oo our 1-step approximation for the reaction 
strength g is exact: 

In the derivation of our result for g in eqn. (|12|) , we had simply used 6x^ = —B^^x^ for 
the correction to the implementation strengths of those patterns which had already 
been embedded into the couplings before the addition of the test pattern I.e. we had 
neglected the (secondary) mutual reaction of patterns with /i and v > 1 in contrast to 
the (primary) response of the patterns on the the test pattern. To be more thorough, 
let us thus try a correction term y^^ taking the secondary and further reaction terms 
into account through = —B'^^x^ + . Inserting this into the equation 6E^ := 
^mo^o ^ + J2l(^^)=i B^"'6x•' J= , we get iteratively 

5x^ = x° ■ i-B^'^ + B^^'B''^- E B^'^B^PB^^ ±...\ . (50) 

Here, the 2nd and 3rd term on the r.h.s. correspond to subsequent parallel AdaTron iter- 
ations. Indices corresponding to patterns, which are automatically implemented without 
explicit embedding, are left out in the sums (which corresponds to a ^ below). For 
the response g we then have 

g = -^(5°'^)2 + Y B^PBP^B^'^'B"^ ±... . (51) 

We assume that there is no selection effect among the correlation matrix elements, i.e. 
that if a pattern is embedded explicitly this does not change the distribution of the B^"" . 
In eqn. (^) the first term on the r.h.s. gives 

_^(50.)2 ^ = -«efr , (52) 



as used in eqn. (p!3|). The dominant contribution comes from terms such that i = j. For 
the next term we have A^ contributing terms i = j = k plus remaining terms represented 
by J2' below: 

YB'^B'^'BP' = ]^ E E c°ere;ejae.° (53) 

u^H ujtfi i,j,k 

_ PcsjPcS — l)N 1^ \ - \ - / (■0(-ij,(-fi(-upu(:0 

i/y^/i i,j,k 

= al^ + Oil/VN). (54) 

From the last explicit summation in ([5l| ) there are just two non-zero terms, one for 
i = j = k = I, the other one for fi = p, i = I and j = k, giving 

- Y B^PBP^B'^^B^^ = E E c°ere;eja4^efef (55) 

plsiPcS-m Peff(Pcff - 1)A^" ^ 1 , 



4-«eff ■ (56) 
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Thus the term ctg^ in ( |51D is cancelled. 

In the following we will see that the same happens for all powers of aes that appear 
at some point in the series. 

First we observe, that the power of aefr in a term is given by the number of different 
pattern indices /x, z/, . . . we sum over. As in the example above, eqs. (^), (|56|) , we can 
eliminate pattern indices by summing over equal pairs. Because of the construction of 
the parallel AdaTron algorithm no next neighbours in a sum (e.g. /x, u above) can be 
eliminated. Next we observe, that when we eliminate a pair of pattern indices (e.g. /i = p 
above), all neuron indices ... in between have to be put equal (e.g. j = k for the 2nd 
term mentioned in connection with eqn. (^5])). Thus trying to "join" a pattern index 
within a pair with a pattern index outside gives no contribution. In other words, once 
we have n = uj in the chain fiupaurjO putting p = 6 does not make any sense (see the 
case ...x...y...x...y... below) . 

The problem of enumerating the number of different ways of contributions that can 
appear can be solved with a little help from combinatorics [^. First we reformulate our 



problem: We have a sequence of n symbols a,b,c, . . for instance [abcdba], which fulfill: 

• The first occurence of every symbol must be in alphabetic order. 

• The same symbol cannot occur twice consecutively. 

• There is no subsequence . . .x . . .y . . .x . . .y . . . unless x = y. 

Here the symbols correspond to our pattern indices p, u, p, . . .. 

There is exactly one way for all symbols to be different, and there are (n — 1) (n — 2)/2 
ways for exactly one pair. There are exactly two ways for 2 identities (counting the 
number of "=" signs needed to fix the pattern indices) in a chain of 5 symbols, namely 
[abaca] and [abcba]. There are 5 possible arrangements in a chain of 7 letters which 
permit 3 eliminations of patterns indices: [abacada], [abcbada], [abacdca], [abcdcba] and 
[abcbdba]. In both cases no additional identity is possible, and we see that the number 
n of pattern indices has to be larger than 2k + 1, with k being the number of identities. 

The number (n, k) of cases, where one has n letters and k identities, can be calculated 
by using a generating function. For small values of n and k the numbers are shown in 
table 1. Let us define a function y{w,x) of two real variables w and x given by the power 
series 

oo [(n-l)/2] 

y{w,x) = {.n,k)w^x'^ 

n=l k=0 

= x + x^ + {l + w)x^ + {l + 3w)x'^ + ... . (57) 

We can obtain the defining equation for y by recursion: If we remove the first letter in 
the sequence, the possibilities are that 

• there is nothing left 

• the letter does not appear again, and there are no restrictions imposed on the rest 
of the sequence 

• the letter appears in the rest of the sequence, and because of the last condition above 
we now have two sequences left (e.g. [bcb] and [ada] in the example [abcbada]). 
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These contributions give the terms x, xy and wxy"^ respectively. Thus we have for the 
defining equation 

y = x(l + y + wy'^) . (58) 



Using the Biirmann-Lagrange series [26| for inversion on this equation tells us that 



[("-l)/2] 1 on-1 

E K^)^' = -l^^i'^ + t + ^t'n=o. (59) 



n! 



This can be written in a more convenient way: 

n ■ {n, k) = the coefficient of w'T-^ in (1 + t + ^yt^)" . (60) 



Thus we have 



((1 +t)+ wtr = E f^) E ^ ^'t^'-"' , (61) 



k=o V'"/ 1=0 

k+n—l 



and putting 2fc + / = n — 1 to get the coefficient of w we arrive at 

, . 1 (n\ (n — k\ , , 

If we put w = 1, we sum the rows in table 1, which gives the defining equation y = 
x{l + y + y"^) for the Motzkin numbers 1 1 2 4 9 21 51 127. . .. The last numbers in lines 
n = 1, 3, 5, . . . are the Catalan series. 

We remember that (n, k) is the coefficient of a^^^^ in the n-th iteration of the AdaTron 
algorithm. We want to prove that in the thermodynamic limit the first term is already 
sufficient; thus we have to show that + /c, /c) = for n > 2. Substituting 

w/x for w in the defining equation moves the column for each k in the table up k steps 
from the beginning. After rearranging we have y = x{l + y)/{l — wy). To produce an 
alternating sum we now use w = —1 and arrive aty = x + + + ..., just as we wanted. 

However, after n such iterations of our simple parallel AdaTron algorithm there is 
still a "tail" with powers of aes ranging from [{n — l)/2] to n, which have not yet been 
eliminated. For aes ^1/3 this tail (and the results of the simple AdaTron algorithm) will 
oscillate with increasing amplitude. But if we now introduce as usual an overrelaxation 
parameter 7 small enough, these oscillations are damped out and the modified AdaTron 



algorithm 5x^ = max (—x^, 7(1 — E^)) converges for all < 1 [T^. At the same time 
we will see that our cavity response theory is correct already after the first step: 

Since we are not concerned with computational efficiency, we can choose an infinites- 
imally small 7, iV~^/^ -C 7 -C 1, after the first AdaTron step, to examine convergence of 
the above-mentioned tail. The number / of AdaTron steps of course has to be increased 
in correspondence to the reduction of 7, so that the product 7/ remains finite. For the 
first few steps of this modified AdaTron algorithm with overrelaxation 7 after the first 
step, the response at pattern then reads: 

92 = -QcS + 7tteff 

93 = -aefT+ (7(1-7) +7)aeff-7^(ttcff + aeff) 
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9a = -tteff + (7(1-7)' + 7(1-7) + 7)«eff - (27^(1-7) + 7')(«eff + "eff) + 7'(3ae'ff + « 



gi = -«eff + 7E(l-7)^«eff-7'E^(l-7)^"'(«eff + «eff) + 
i=0 i=l 

f E^^(l-7)^-^(3«e\ + «e^ff)-.-- • 
i=2 ^ 

Summing the geometrical series and using (1 —7)' ~ e^'^^ for 7—^0+ and / 00, we get 

gi ^ -a,ff+(l-e-^Veff-(l-e-^'(l + 70)(«eff + «eff) 
+ (1 - e-^'(l + 7/ + (7/)V2))(3«,\ + ats) - • • • 

00 n-l (^]\m [(n-l)/2] 

= -«eff+E(-l)"(l-e-^'E^) E in,k)a:,-' 

n=2 m=l k=0 

Numerically the sum in ( |63D decays to 0. Convergence is faster as Oeff decreases, just as 
expected. For the critical value a^s = 1, where g —>■ —1 and a etc, we can examine 
the convergence analytically: 

Collecting the series in (p3|) and expanding e~"''' gives for a^s = 1 



. 6(2„-l)!W)" 



n— 1 

((^-l)0V(^ + 2)! 
7/- 1^1(3/2,4,-47/) (64) 
I67/ 



IT 



[\-^^''Vt{l-tf/^dt . (65) 

JO 



iFi in eqn. (|64D is the confluent hypergeomtric function, in eqn. (|65|) we introduce its 
integral representation. We are interested in the behaviour for —>■ 00 (while 7 -C 1), 
therefore we can replace the (1— t)'^/^ term in the integral by 1 and perform the integration 
analytically. Thus the additional disturbance decays like 

hm gi +1 = ^ . (66) 

We now have shown, that even for the critical value a^s = 1 the additional terms 
decay like l/y/TT'-fl] for cteff < 1, which are the values we are interested in, numerical 
examination universally shows even faster decay. Therefore, for N ^ 00, our 1-step- 
reaction approach for g, which is completely in the spirit of Onsager's cavity approach, 
is correct. This is of course also true for the multilayer perceptrons studied. 

A further study, shows that the patterns which are explicitly embedded by the 
AdaTron algorithm, have a slight negative correlation. However this small selection effect 
0{1/N) contributes even less than the correction effects treated above. The same is true 
for the selection effect introduced by the combinatorial explosion, which allows to look 
for an optimal groundstate. At the level of the correlation matrix elements B, this again 
results in an effect 0{1/N) and can be neglected. 
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Figure Captions 



Fig. 1: For two different error rates / the storage capacity a is presented as a function 
of the stability k for both estimates ( p!7D and (|20|). 

Fig. 2: The stabihty k is presented as a function of the error rate / for a = 2.0, 0.8 
and 0.4 . Results are shown for the cavity-method and the replica calculation in RS and 
1-step-RSB approximation. 

Fig. 3: The probability density P{t) of the local fields t before learning is presented 
as a function of t at a = 1 and k, = 1. Again results are shown for the cavity-method 
and the replica calculation in RS and 1-step-RSB approximation. The distribution after 
training is given by pushing the middle segment to a 5-Peak at k = 1 in every case. The 
cavity-method yields an error rate / = 0.29092, the replica method gives / = 0.13073 
in RS and / = 0.13576 in 1-step-RSB approximation. The fraction of explicitly learned 
patterns is 0.55042 for the cavity approximation and 0.71062 and 0.66745 respectively 
for the replica calculations. 

Fig. 4: The error rate / is presented as a function of a for k = 1.25, 0.5 and 0. The 
results are given for the cavity method, the RS and 1-step-RSB approximation as well 
as the result /kt, which is calculated by putting together the Kuhn- Tucker conditions 
and the local fields of the RSB-solution. The best estimate is very likely between these 
RSB-graphs, i.e. /rsb and /kt, and probably closer to /rsb- 

Fig. 5: The AND-machine with tree structure (NRF). The weights between the in- 
termediate layer and the output neuron are fixed. A threshold between and 1 at 
the output-neuron takes care that there only is a positive output if both intermediate 
neurons are positive. 

Fig. 6: The storage capacity of the AND-machine is presented as a function of k 
for b = 0. The results for the cavity method and for the replica calculation are given. 
The storage capacity ctfuu of the fully connected AND-machine, see |3]^, and ctperc of the 
simple perceptron are presented for comparison 

Fig. 7: The overlap R of both subnetworks of the fully connected AND-machine is 
presented as a function of the bias parameters b from (^4|). The result for the cavity 



method gives a higher correlation of the subnetworks compared to Fig. 3. a in |jT^. We 
have i? = at 6 = —1/3, which means a_ = 2a+, in contrast to i? = at 6 = in [|l^ . 

Fig. 8: The maximal storage capacity of the AND-machine is presented as a function 
of the bias parameter b from p^). The result of the cavity method for the tree structure is 
compared to the RS-result from and the cavity result for the fully connected AND- 
machine. The largest deviations occur for small values of b. For the fully connected 
AND-machine the storage capacity is again smaller than with the replica calculation 
(see Fig. 3.b in [|l3])- We have ac-fuu = «c-tree at 6 = —1/3, when R = 0. For 6^—1 
we have ac-fuu — ^ oo as in ||T7| . 

Table 1: The numbers (n, k) of possible arrangements of n letters with k identities as 
described in the text. At the same time, these numbers are the coefficents of a^s'^ in the 
n-th term on the r.h.s. of eqn.(n). In the main text it is shown, that the alternate sum 
along the diagonals - connected as a guide to the eye - disappears: Z]i(~l)'('^ + ^ = 0- 
E.g. 1-3 + 2 = 0, 1-6 + 10 - 5 = etc.. 
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Figure 2: The stability k is presented as a function of the error rate / for a = 2.0, 0.8 and 
0.4. Results are shown for the cavity-method and the replica calculation in RS and 1-step-RSB 
approximation . 
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Figure 3: The probability density P{t) of the local fields t before learning is presented as a 
function of t at a = 1 and k = 1. Again results are shown for the cavity- method and the replica 
calculation in RS and 1-step-RSB approximation. The distribution after training is given by 
pushing the middle segment to a 5-Peak at k = 1 in every case. The cavity-method yields 
an error rate / = 0.29092, the replica method gives / = 0.13073 in RS and / = 0.13576 in 
1-step-RSB approximation. The fraction of explicitly learned patterns is 0.55042 for the cavity 
approximation and 0.71062 and 0.66745 respectively for the replica calculations. 
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Figure 4: The error rate / is presented as a function of a for k = 1.25, 0.5 and 0. The resuhs 
are given for the cavity method, the RS and 1-step-RSB approximation as weh as the result 
/kt, which is calculated by putting together the Kuhn-Tucker conditions and the local fields 
of the RSB-solution. The best estimate is very likely between these RSB-graphs, i.e. /rsb and 
/kt, and probably closer to /rsb- 




threshold 1/2 

Figure 5: The AND-machine with tree structure (NRF). The weights between the intermediate 
layer and the output neuron are fixed. A threshold between and 1 at the output-neuron takes 
care that there only is a positive output if both intermediate neurons are positive. 



28 



0.0 0.5 1.0 1.5 2.0 2.5 3.0 



Figure 6: The storage capacity of the AND-machine is presented as a function of k for 6 = 0. 
The results for the cavity method and for the rephca calculation are given. The storage 
capacity af^u of the fully connected AND-machine, see p.3| , and aperc of the simple perceptron 
are presented for comparison 
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Figure 7: The overlap R of both subnetworks of the fully connected AND-machine is presented 
as a function of the bias parameters b from (^) . The result for the cavity method gives a higher 
correlation of the subnetworks compared to Fig. 3. a in |17|. We have i? = at 6 = —1/3, 
which means a_ = 2a_|_, in contrast to -R = at 6 = in [^]. 
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Figure 8: The maximal storage capacity of the AND-machine is presented as a function of the 



bias parameter h from (24). The result of the cavity method for the tree structure is compared 



to the RS-result from [17| and the cavity result for the fully connected AND-machine. The 
largest deviations occur for small values of b. For the fully connected AND-machine the storage 
capacity is again smaller than with the replica calculation (see Fig. 3.b in We have 

oc-fuU = «c-tree at b = —1/3, wheu i? = 0. For 6 ^ — 1 we have ac-fuU ^ cxd as in [|l7|] . 
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Figure 9: Table 1: The numbers {n,k) of possible arrangements of n letters with k identities 
as described in the text. At the same time, these numbers are the coefficents of in 
the nth term on the r.h.s. of eqn.(|5l|). In the main text it is shown, that the alternate sum 
along the diagonals - connected as a guide to the eye - disappears: Y^ii—^Yin + 1,1) = 0. 
E.g. 1-3 + 2 = 0, 1-6 + 10 - 5 = etc.. 
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